Predicting SPARQL Query Performance
نویسندگان
چکیده
We address the problem of predicting SPARQL query performance. We use machine learning techniques to learn SPARQL query performance from previously executed queries. We show how to model SPARQL queries as feature vectors, and use k -nearest neighbors regression and Support Vector Machine with the nu-SVR kernel to accurately (R value of 0.98526) predict SPARQL query execution time. 1 Query Performance Prediction The emerging dataspace of Linked Data presents tremendous potential for largescale data integration over cross domain data to support a new generation of intelligent application. In this context, it increasingly important to develop efficient ways of querying Linked Data. Central to this problem is knowing how a query would behave prior to executing the query. Current generation of SPARQL query cost estimation approaches are based on data statistics and heuristics. Statistics-based approaches have two major drawbacks in the context of Linked Data [9]. First, the statistics (e.g histograms) about the data are often missing in the Linked Data scenario because they are expensive to generate and maintain. Second, due to the graph-based data model and schema-less nature of RDF data, what makes effective statistics for query cost estimation is unclear. Heuristics-based approaches generally do not require any knowledge of underlying data statistics. However, they are based on strong assumptions such as considering queries of certain structure less expensive than others. These assumptions may hold for some RDF datasets and may not hold for others. We take a rather pragmatic approach to SPARQL query cost estimation. We learn SPARQL query performance metrics from already executed queries. Recent work [1, 3, 4] in database research shows that database query performance metrics can be accurately predicted without any knowledge of data statistics by applying machine learning techniques on the query logs of already executed queries. Similarly, we apply machine learning techniques to learn SPARQL query performance metrics from already executed queries. We consider query execution time as the query performance metric in this paper. 2 Modeling SPARQL Query Execution We predict SPARQL query performance metrics by applying machine learning techniques on previously executed queries. This approach does not require any 2 Predicting SPARQL Query Performance statistics of the underlying RDF data, which makes it ideal for the Linked Data scenario. We use two types of query features: SPARQL algebra features and graph pattern features. We use frequencies and cardinalities of the SPARQL algebra operators , and depth of the algebra expression tree as SPARQL algebra features. Regarding graph patterns features, transforming graph patterns to vecFig. 1. Example of extracting SPARQL feature vector from a SPARQL query. tor space is not trivial because the space is infinite. To address this, we create a query pattern vector representation relative to the query patterns appearing in the training data. First, we cluster the structurally similar query patterns in the training data into Kgp number of clusters. The query pattern in the center of a cluster is the representative of query patterns in that cluster. Second, we represent a query pattern as a Kgp dimensional vector where the value of a dimension is the structural similarity between that query pattern and the corresponding cluster center query pattern. To compute the structural similarity between two query patterns, we first construct two graphs from the two query patterns, then compute the approximate graph edit distance – using a suboptimal algorithm [7] with O ( n ) computational complexity – between these two graphs. The structural similarity is the inverse of the approximate edit distance. We use the k -mediods [5] clustering algorithm to cluster the query patterns of training data. We use k -mediods because it chooses data points as cluster centers and allows using an arbitrary distance function. We use the same suboptimal graph edit distance algorithm as the distance function for k -mediods. Figure 1 shows an example of extracting SPARQL algebra features (left) and graph pattern features (right) from SPARQL query string. 1 Algebra operators: http://www.w3.org/TR/sparql11-query/#sparqlAlgebra Predicting SPARQL Query Performance 3 3 Experiments and Results We generate 1260 training, 420 validation, and 420 test queries from the 25 DBPSB benchmark query templates [6]. To generate queries, we assign randomly selected RDF terms from the DBpedia 3.5.1 dataset to the placeholders in the query templates. We run the queries on a Jena-TDB 1.0.0 triple store loaded with DBpedia 3.5.1 and record their query execution time. We exclude queries which do not return any result (queries from template 2, 16, and 21) and run more than 300 seconds (queries from template 20). We experiment with k -nearest neighbors (k -NN) regression [2] and Support Vector Machine (SVM) with the nu-SVR kernel for regression [8] to predict query execution time. We achieve an R value of 0.9654 (Figure 2(a)) and a root mean squared error (RMSE) value of 401.7018 (Figure 2(b)) on the test dataset using k -NN (with Kgp = 10 and k = 2 selected by cross validation). We achieve an improved R value of 0.98526 (Figure 2(c)) and a lower RMSE value of 262.1869 (Figure 2(d)) using SVM (with Kgp = 25 selected by cross validation). This shows that our approach can accurately predict SPARQL query execution time.
منابع مشابه
Learning-Based SPARQL Query Performance Prediction
According to the predictive results of query performance, queries can be rewritten to reduce time cost or rescheduled to the time when the resource is not in contention. As more large RDF datasets appear on the Web recently, predicting performance of SPARQL query processing is one major challenge in managing a large RDF dataset efficiently. In this paper, we focus on representing SPARQL queries...
متن کاملPredicting SPARQL Query Performance and Explaining Linked Data
As the complexity of the Semantic Web increases, efficient ways to query the Semantic Web data is becoming increasingly important. Moreover, consumers of the Semantic Web data may need explanations for debugging or understanding the reasoning behind producing the data. In this paper, firstly we address the problem of SPARQL query performance prediction. Secondly we discuss how to explain Linked...
متن کاملSPARQL Query Optimization on Top of DHTs
We study the problem of SPARQL query optimization on top of distributed hash tables. Existing works on SPARQL query processing in such environments have never been implemented in a real system, or do not utilize any optimization techniques and thus exhibit poor performance. Our goal in this paper is to propose efficient and scalable algorithms for optimizing SPARQL basic graph pattern queries. ...
متن کاملEvaluation of SPARQL query generation from natural language questions
SPARQL queries have become the standard for querying linked open data knowledge bases, but SPARQL query construction can be challenging and timeconsuming even for experts. SPARQL query generation from natural language questions is an attractive modality for interfacing with LOD. However, how to evaluate SPARQL query generation from natural language questions is a mostly open research question. ...
متن کاملPLANET: Query Plan Visualizer for Shipping Policies against Single SPARQL Endpoints
Shipping policies allow for deciding whether a query should be executed at the server, the client or distributed among these two. Given the limitations of public SPARQL endpoints, selecting appropriate shipping plans is crucial for successful query executions without harming the endpoint performance. We present PLANET, a query plan visualizer for shipping strategies against a single SPARQL endp...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014